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Abstract 

The Content Knowledge for Teaching Mathematics instrument was developed by the Study for 
Instructional Improvement and Learning Mathematics for Teaching projects at the University 
of Michigan to measure elementary school and middle school in-service teachers ’ 
mathematical knowledge for teaching to assist in the evaluation of professional development 
programs for mathematics teachers. This instrument is currently in widespread use among 
colleges and universities for the purpose of evaluating mathematics education programs for 
prospective elementary and middle school teachers. Since this is an “off-label use of this 
instrument, this article establishes the reliability of the instrument among this new population 
of pre-service teachers. 

Introduction 

One key component of improving the mathematical education of students is to improve the 
knowledge of their teachers. This knowledge for teaching is complex and includes knowledge 
about the subject, the students, the curriculum, classroom management, and so forth. In his 
Presidential Address at the 1985 annual meeting of the American Educational Research 
Association, Eee Shulman laid out a construct regarding this knowledge needed for teaching 
(1986). Shulman divided the construct of knowledge for teaching into three major components: 
“(a) subject matter content knowledge, (b) pedagogical content knowledge, and (c) curricular 
knowledge” (Shulman, 1986, p. 9). 

In the realm of elementary mathematics, this subject matter content knowledge would 
coincide with what Eiping Ma describes as a “profound understanding of fundamental 
mathematics” (Ma, 1999b). It is “going beyond knowledge of the facts or concepts” and 
“understanding the structures” of mathematics (Shulman, 1986, p. 9). In particular, “the teacher 
need not only understand that something is so; the teacher must further understand why it is so, 
on what grounds its warrant can be asserted, and under what circumstances our belief in its 
justification can be weakened or even denied” (Shulman, 1986, p. 9). 

The construct of subject matter content knowledge for teaching elementary mathematics may 
be further divided into common content knowledge and specialized content knowledge (Hill, 
Schilling, & Ball, 2004; Hill, Dean, & Goffney, 2005; Hill, Dean, & Goffney, 2007). Common 
content knowledge is “knowledge that is common to many disciplines and the public at large,” 
while specialized content knowledge is “knowledge specific to the work of teaching” (Hill et al, 
2007, p. 82). 
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The Content Knowledge for Teaching Mathematics Instrument 

Purpose and History: There are eurrently many programs in the United States foeusing on 
improving the eontent knowledge for teaehing of mathematies of elementary sehool teaehers. 

The National Seienee Foundation’s Math and Seienee Partnership program or Department of 
Edueation Math and Seienee Partnership programs sponsor the majority of these programs. With 
the funds for these programs eomes a requirement for evaluation of the programs. 

Beeause of this demand for instruments to measure the growth of teaehers’ mathematieal 
knowledge for teaehing over the eourse of these professional development programs, the 
National Seienee Foundation’s Math and Seienee Partnership program has funded several 
programs to ereate sueh instruments. The Learning Mathematies for Teaehing projeet at The 
University of Miehigan is one sueh projeet and they have developed a series of instruments 
ealled the Content Knowledge for Teaehing Mathematies (CKT-M) instruments. 

Sinee the CKT-M arose in response to a need of large professional development programs, 
the development group of the CKT-M instrument determined that the instrument must satisfy 
eertain requirements. These ineluded the need to measure large numbers of partieipants without 
taking a large amount of time or money; the reliability of the instrument should be sueh that it 
eould aeeurately measure the performanee of groups, but not individuals; and the instrument 
must eontain linked forms to use as pretests and posttests (Hill & Ball, 2004; Hill et al, 2004; 
Blunk, Hill, & Phelps, 2005; Hill, 2007a; Hill, 2007b; Hill, 2007e). 

In addition to professional development programs, many pre-serviee teaeher programs are 
also using the CKT-M instrument. Many of these programs have gone through major revisions in 
the past few years, partially as a result of No Child Left Behind legislation, whieh inereased the 
number of hours of undergraduate mathematies eourses required of elementary teaehers. These 
ehanges also developed from a report of the Conferenee Board of the Mathematieal Seienees 
with reeommendations about what mathematieal eourses and topies should be ineluded in 
undergraduate programs designed for future teaehers (2001). 

Sinee the reliability of the CKT-M instrument was established using experieneed in-serviee 
teaehers enrolled in professional development programs (Hill & Ball, 2004; Hill et al, 2004), 
these reliability information for these instruments is needed for this new distinet demographie or 
pre-serviee teaehers. This artiele will explore the reliability of a single published form ineluding 
eaeh of the three sub-seales eorresponding to the eontent areas of numbers and operations; 
geometry; and patterns, funetions, and algebra. 

Methodology 

Data Collection and Sample: Over a period of four aeademie semesters, 424 pre-serviee 
teaehers enrolled in mathematies eourses for elementary teaehers at a large university in the 
southeastern United States served as study subjeets. The students enrolled in these eourses had 
already eompleted a traditional mathematies eourse, usually eollege algebra, but had not yet 
eompleted many eourses in edueation and had limited exposure to the elementary elassroom. 
Sinee these mathematies eourses are prerequisites for many of the edueation eourses involved in 
the elementary edueation major, nearly all of the partieipants were in their freshman or 
sophomore year at the university. 

The partieipants eompleted the survey instrument during a regularly seheduled elass time 
within the first three weeks of elasses during four subsequent semesters. They reeeived an 
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adequate amount of time so that all partieipants were able to eomplete the instrument within the 
elass period. 

The partieipants were 97% female and ranged in age from 19-35 with over 95% being under 
the age of 22. Additionally, 93% deseribed themselves as Caueasian/White, with 5% Afriean 
Ameriean/Blaek, and the remaining 2% in other eategories. 

Instrument: The Content Knowledge for Teaehing Mathematies (CKTM) instrument eonsists 
of multiple-ehoiee questions designed to gain understanding of an individual’s knowledge of 
mathematieal eontent in the three areas of number and operations; geometry; and patterns, 
funetions, and algebra. To have a better idea of the type of items ineluded in these instruments, 
an example of an item, ehosen from the released items, in the area of number and operation is in 
the appendiees. The aetual items eannot be shared due to the use agreement for the instrument. 

To eompare the reliability of the CKT-M instrument between pre-serviee and in-serviee 
teaehers, the analysis used a pre-existing form that eontained approximately equal number of 
items from the three eontent areas of number and operation; geometry; and patterns, funetions, 
and algebra. The ehoiee of the 2004(B) form was beeause it has undergone several revisions and 
has a reported three distinet faetors eorresponding to three major eontent areas from pre-serviee 
mathematies eourses, number and operation; geometry; and patterns, funetions, and algebra 
(Hill, Sehilling, & Ball, 2004; Hill, Dean, & Goffney, 2007; Sehilling, 2007). The only ehange 
from the standardized form is the removal of one item due to a typographieal error in some of the 
eopies. 

Form Reliability Analyses: Following the strueture of the original CKTM form, the items 
were divided into three distinet sub-seales based upon the mathematieal subjeets of number and 
operation; geometry; and patterns, funetions, and algebra. Eaeh of the sub-seales was then 
analyzed using a two-parameter item response theory model in MULTILOG (Thissen, 2003) to 
eorrespond with the previous analysis of the form using in-serviee teaehers. The analysis 
ineluded determining how well the item response theory model fit the observed data followed by 
eomparisons between the models generated using pre-serviee and in-serviee teaehers of the item 
parameters, the instrument’s information and standard error eurves, and the marginal reliability 
for eaeh of the three sub-seales. 

The two-parameter item response theory model generated an item diffieulty parameter and an 
item diserimination parameter for eaeh of the items in the three sub-seales. The two parameters 
for eaeh item generated by the item response theory model are the eore of the model and 
generate all other results from the model ineluding the item eharaeteristie, item information, 
instrument information, and standard error eurves. 

Goodness of Fit: In order to verify that the two-parameter model is appropriate for this 
instrument, with this population, eaeh of the three sub-seales underwent a goodness of fit 
analysis. This involved a eomparison of the model’s estimated ability of the subjeets with their 
measured seore using a graphieal analysis in addition to a eorrelation. 

In addition to testing the ability of the model to estimate an individual’s ability level, it is 
also neeessary to verify the ability of the model to estimate partieipant performanee on eaeh 
item. The item diffieulty and diserimination parameters for eaeh item generate an item 
eharaeteristie eurve whieh estimates how likely individuals at various ability levels are to answer 
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the item correetly. This item characteristic curve for the i-th item is a logarithmic curve given by 
the equation 

gai'.S-bi.- 

where 0 is a participant’s estimated ability level, h. is the discrimination parameter, and the 
difficulty parameter of the i-th item. 

For each item, an Average Absolute Standardized Residual determined if the item 
characteristic curve for that item matched the observed percent correct for the subjects at each 
estimated ability level (Hambleton, 1991). This Average Absolute Standardized Residual was 
then compared to the item’s ability and discrimination parameters to determine which types of 
items best fit the observed data. 

Item Parameters; The item difficulty parameter is the ability level at which half of the 
subjects answer the question correctly. Subjects whose ability level is below this difficulty 
parameter are likely to answer the question incorrectly while those whose ability level is above 
the difficulty parameter are likely to answer the question correctly. Therefore, the item answer 
difficulty parameters should vary between around two standard deviations above and below the 
mean for items appropriate for the sampled population. While the BILOG software (Mislevy & 
Bock, 1997) used in the analysis of the data collected from in-service teachers restricts the 
difficulty parameters to this interval, the MULTILOG software (Thissen, 2003)used in the pre- 
service analysis does not have such restrictions. 

Since the standard deviation for the difficulty parameters generated with pre-service teacher 
data was as high as 5.24, an independent-measures t-test is unable to measure the difference 
between the parameters generated by the in-service and pre-service teachers. Instead, each item 
is treated as an individual for a related-samples t-test. These parameters were compared for all 
three sub-scales and the full scale. 

The item difficulty parameters are likely different since the pre-service teachers’ 
mathematical knowledge for teaching is similar to, but not as strong as that of the in-service 
teachers. Therefore, a one-tailed repeated-measures t test was used to measure the significance of 
this difference. 

The item discrimination parameter describes how well an item differentiates subjects at that 
item’s difficulty level. Mathematically, this is the slope of the curve at the ability level equal to 
the difficulty parameter. Theoretically, the item discrimination parameters should be similar 
between the pre-service and in-service models, and so a two-tailed repeated-measures t test was 
used to determine significance. 

Instrument Information and Standard Error Curves: For each item, the difficulty and 
discrimination parameters generate an item information curve from the item characteristic curve 
/)(6*) , given by the formula 
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These item information eurves are added together to ereate the instrument information eurve 

gSiiiS-bC 
( 1 - 

which communicates how much information the instrument provides at various ability levels of 
the subjects. 

The standard error curve is the “standard deviation of the asymptotically normal distribution 
of the maximum likelihood estimate of ability for a given true value of ability” (Hambleton, 

1991, p. 95). These two curves are related in that the instrument information curve is the 
reciprocal of the square of the standard error curve. 

For each of the three sub-scales and the full scale, the instrument information and standard 
error curves of the model generated using pre-service teacher data was compared to the curves 
generated using the in-service teacher data. The curves for each of the sub-scales and the full 
scale were graphed on the same axes to allow for easier comparison even though the ability 
levels used on the independent axis are different for the two models. 

Marginal Reliability: Even though one of the benefits of item response theory is the ability to 
measure an instrument’s reliability for subjects at various ability levels, it is often desired to have 
a single index of reliability for the entire instrument. Along these lines, one defines a marginal 
measurement error as 

CsWa'S 

where (7^(0) is the standard error function derived from the instrument information curve and 
5 (0) is the ability distribution of the sample population. 

The marginal reliability (Green, 1984; Thissen, 2001) of the instrument is then defined as 

_ VaiTancs(0) - cr/„, 

^ Variance (0) 

The marginal reliability of each of the three sub-scales and the full scale were computed using 
the pre-service teacher data. These reliabilities were then compared to those generated using the 
in-service teacher data (Blunk, Hill, & Phelps, 2005; Hill, 2007a; Hill, 2007b). 

Results 

Goodness of Fit: The first method used to verify the goodness of fit for the 2-parameter item 
response model is to compare the model’s estimates of ability to the individual’s actual score. 

For the entire CKT-M scale and the three sub-scales (Number and Operation; Geometry; 

Patterns, Functions, and Algebra), one can see from Figure 1, there is a perfect fit between the 
data. This is verified by the correlations between estimated ability and true score being 0.9989 
(Full Scale), 0.9984 (Number and Operations), 0.9978 (Geometry), and 0.9978 (Patterns, 
Functions, and Algebra). 

To verify the goodness of fit of the model for the items, the Average Absolute Standardized 
Residual (AASR) was computed for each item using the item parameters generated using the 
Full Scale. For the AASR to be meaningful, only ability ranges which include a significant 
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number of participants is included. Using step sizes of 0.2 in the ability estimates, only the 
ability range of -0.8 to 1.2 had over 10 participants and was included in the analysis. 
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Figure 1: Graphs of Abilities versus True Score for the Full Scale and Three Sub-scales 

After removing an outlier with an AASR of 3.73 and difficulty parameter of -541 (almost all 
subjects answered the item correctly), the AASR’s of the items had a mean of 1.20 and standard 
deviation of 0.34, with a range of 0.63 to 2.3 1 . The correlation of the AASR with the difficulty 
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parameter was 0.02 for items whose diffieulty is within the range of -2.0 to 2.0 and 0.79 for 
items outside this range. This implies that the items whose difficulty parameter lies outside the 
range of participants’ ability levels do not fit the model well. Similarly, the correlation between 
the discrimination parameter and the AASR of these items was 0.46 with higher discriminating 
items not performing as well as lower discriminating items (See Figure 2). 
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Figure 2: Comparison of the Discrimination Parameter and Average Absolute Standardized Residual of 

CKT-M Items 


Since the 2-parameter item response theory model fits the measured data for the participant 
ability levels and the item parameters, this model is appropriate for evaluating the usefulness of 
this instrument with the population of pre-service teachers. 

Item Parameters: Since the discrimination parameter is the slope of the item characteristic 
curve and describes how well an item differentiates between individuals at the item’s difficulty 
level, this parameter should be independent of a population’s mathematical knowledge level. As 
suspected, there is no significant difference between the discrimination parameters generated 
using in-service and pre-service teachers. The Number and Operation sub-scale (M =0.06, SD 
=0.34), Geometry sub-scale (M = -0.04, SD =0.44), Patterns, Functions, and Algebra sub-scale 
(M = -0.09, SD =0.46), and the Full scale (M =0.03, SD =0.41) all fell within the range on the 
two-tailed Mest for the appropriate degrees of freedom to accept the null-hypothesis that there is 
no significant difference. 

Unlike the discrimination parameter, the difficulty parameter, which measures the point on 
the ability level where half of the population answers the item correctly, is expected to vary 
according to the population’s overall ability level. For the Number and Operation sub-scale, the 
difficulty parameters decreased (M =1.29, SD =4.68) between the model using pre-service 
teachers and the one using in-service teachers. This reduction was statistically significant, t(23) = 
-1.35, p< 0.05, one-tailed. Similarly, the Geometry difficulty parameters decreased (M =0.95, 

SD =2.89) a statistically significant amount, t(22) = -1.55, p< 0.05, one-tailed. The Patterns, 
Functions, and Algebra difficulty parameters decreased at an even higher rate (M =2.69, SD 
=6.16) which was statistically significant, t(17) = -1 .85, p< 0.05, one-tailed. When looking at the 
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full scale, this decrease of the difficulty parameters (M = -1.56, SD =4.62) was also statistically 
significant, t(64) = -2.72, p< 0.05, one-tailed. 

The decrease in the difficulty parameters fits the hypothesis that the in-service teachers have 
a significantly higher level of mathematical knowledge for teaching than pre-service teachers. 

Instrument Information and Standard Error Curves: The Number and Operation sub-scale 
exhibited the largest difference between the models using pre-service and in-service teacher data 
(See Figure 3). For the majority of participants (within one standard deviation of the mean), the 
instrument provided significantly less information when used with pre-service teachers than with 
in-service teachers. 
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Figure 3: Instrument Information Curves 

Furthermore, the standard error for the instrument was always above 0.80 when used with 
pre-service teachers, while in-service teachers, whose ability ranges between two standard 
deviations below and one standard deviation above the mean as computed by the model, had a 
standard error of less than 0.75. ( See Figure 4.) This large difference is likely due to the Number 
and Operation sub-scale measuring the mathematical content most common in the elementary 
classroom and so the in-service teachers are more likely to have recently worked with the 
information contained in this instrument. 

Since the items on the Geometry sub-scale are focused on subject matter dealt with in the 
latter elementary grades and middle school, the graphs from this sub-scale show that the 
instrument likely performed better for pre-service teachers than for in-service teachers since they 
had seen the material more recently. This phenomenon also occurred with the Patterns, 
Functions, and Algebra sub-scale to an extent. However, since this sub-scale included questions 
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regarding exponential growth that almost none of the pre-serviee teaehers answered eorreetly, 
the instrument information curve for the pre-service teacher model was lower. 
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Figure 4: Standard Error Curves 

When all three sub-scales are combined into the full scale, both the instrument information 
curve and the standard error curves for the two models are nearly identical. Furthermore, from 
these graphs (Figure 3 and Figure 4), one can conclude that the full scale instrument is very 
reliable with the standard error below 0.5 for nearly all pre-service and in-service teachers. 

Marginal Reliability. As with traditional reliability, the marginal reliability is a coefficient 
between 0 and 1 that measures the proportion of the instrument score is attributed to the actual 
ability level of the participant rather than noise. For each of the three sub-scales, the marginal 
reliability is given in Table 1. Since this instrument is designed to differentiate between groups, 
often as a pre and post test, the reliability indices should be in the range of 0.75 to 0.85 
(DeVellis, 1991, p. 85-86). Therefore, the Geometry sub-scale is the only sub-scale near 
appropriate reliability to use for pre-service teachers. 


Table 1; Marginal Reliability for Pre-service and In-service Models 


Sub-scale 

Pre-service Model 

In-service Model 

Number and Operations 

0.682 

0.80 

Geometry 

0.717 

0.861 

Patterns, Functions, and Algebra 

0.675 

0.757 
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If one combines the three sub-scales to form the full scale, the reliability of the instrument for 
pre-service teachers becomes 0.8545, which is on the upper end of reliability for use at the group 
level. 

The marginal reliability in each situation above was computed under the assumption that the 
sub-scales are composed of independent items within the 2-parameter item response theory 
model. In reality, these sub-scales are composed of several testlets which do not have 
independence. Therefore, the marginal reliability for the three sub-scales and the full scale are 
likely significantly lower (Sireci, 1991). 


Discussion 

Changing the population from in-service teachers to pre-service teachers had a large effect on 
the item parameters and reliability of the CKT-M form used. The main consequence of this result 
is researchers should not use the reliability information created using data from in-service 
teachers when using the CKT-M instrument with pre-service teachers. Researchers should 
instead make sure that they collect data from enough subjects to run a thorough item response 
theory analysis on their forms and use these results in reporting their results and should make 
their own forms specifically for the population of pre-service teachers. 

Since the completion of an item response theory analysis is not always possible for every 
project, there is a need to create specific forms with reliability data generated using pre-service 
teachers. As evidenced from the results of this study, this form for pre-service teachers may 
consist of currently developed items from the Learning Mathematics for Teaching item pool, but 
would not be a previously compiled form. Much work still needs to be completed to determine 
which items are most appropriate for pre-service teachers and how many items might be needed 
to have adequate reliability when used with pre-service teachers. 
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Appendix 1: Sample Released Item from CKT-M 

Ms. Harris was working with her class on divisibility rules. She told her class that a number 
is divisible by 4 if and only if the last two digits of the number are divisible by 4. One of her 
students asked her why the rule for 4 worked. She asked the other students if they could 
come up with a reason, and several possible reasons were proposed. Which of the following 
statements comes closest to explaining the reason for the divisibility rule for 4? (Mark ONE 
answer.) 

a) Four is an even number, and odd numbers are not divisible by even numbers. 

b) The number 100 is divisible by 4 (and also 1000, 10,000, etc.). 

c) Every other even number is divisible by 4, for example, 24 and 28 but not 26. 

d) It only works when the sum of the last two digits is an even number. 
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